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The ability to generalize is an important feature of any intelligent agent. Not only because it 
may allow the agent to cope with large amounts of data, but also because in some environments, 
an agent with no generalization ability is simply doomed to fail. In this work we outline several 
criteria for generalization, and present a dynamic and autonomous machinery that enables projective 
simulation agents to meaningfully generalize. Projective simulation, a novel, physical, approach to 
artificial intelligence, was recently shown to perform well, in comparison with standard models, on 
both simple reinforcement learning problems, as well as on more complicated canonical tasks, such 
as the “grid world” and the “mountain car problem”. Both the basic projective simulation model 
and the presented generalization machinery are based on very simple principles. This simplicity 
allows us to provide a full analytical analysis of the agent’s performance and to illustrate the benefit 
the agent gains by generalizing. Specifically, we show how such an ability allows the agent to learn 
in rather extreme environments, in which learning is otherwise impossible. 


I. INTRODUCTION 

The ability to act upon a new stimulus, based on previ¬ 
ous experience with similar, but distinct, stimuli, some¬ 
times denoted as generalization, is used extensively in 
our daily life. As a simple example, consider a driver’s 
response to traffic lights: The driver need not recognize 
the details of a particular traffic light in order to response 
to it correctly, even though traffic lights may appear dif¬ 
ferent from one another. The only property that matters 
is the color, whereas neither shape nor size should play 
any role in the driver’s reaction. Learning how to react 
to traffic lights thus involves an aspect of generalization. 

A learning agent, capable of a meaningful and useful 
generalization is expected to have the following charac¬ 
teristics: (a) an ability for categorization (recognizing 
that all red signals have a common property, which we 
can refer to as redness); (b) an ability to classify (a new 
red object is to be related to the group of objects with 
the redness property); (c) ideally, only generalizations 
that are relevant for the success of the agent should be 
learned (red signals should be treated the same, whereas 
squared signals should not, as they share no property 
that is of relevance in this context); (d) correct actions 
should be associated with relevant generalized properties 
(the driver should stop whenever a red signal is shown); 
and (e) the generalization mechanism should be flexible. 

To illustrate what we mean by “flexible generaliza¬ 
tion” , let us go back to our driver. After learning how 
to handle traffic lights correctly, the driver tries to fol¬ 
low arrow signs to, say, a nearby airport. Clearly, it is 
now the shape category of the signal that should guide 
the driver, rather than the color category. The situa¬ 
tion would be even more confusing, if the traffic signal- 
ization would suddenly be based on the shape category 
alone: square lights mean “stop” whereas circle lights 
mean “drive”. To adapt to such environmental changes 


the driver has to give up the old color-based general¬ 
ization and build up a new, shape-based, generalization. 
Generalizations must therefore be flexible. 

In reinforcement learning (RL), where an agent learns 
via interaction with a rewarding environment gen¬ 

eralization is often used as a technique to reduce the size 
of the percept space, which is potentially very large. For 
example, in the Q-learning [3] and SARSA algorithms, 
it is common to use function approximation methods [Il¬ 
ls], realized by e.g. tile coding (CMAC) [BHH], neural net¬ 
works [siisHn], or support vector machines mill US], 
to implement a generalization mechanism. More modern 
approaches include filtered Q iteration m and decision 
trees (RL-DT) [15]. Alternatively, in learning classifier 
systems (LCS), generalization is facilitated by using the 
wildcard ff character, which, roughly speaking, means 
that a particular category is irrelevant for the present 
task environment [MIH]. 

In this paper we introduce a notion of generalization 
into the recently developed model of projective simula¬ 
tion (PS) [IS]. PS is a physical approach to artihcial 
intelligence which is based on stochastic processing of 
experience. It uses a particular type of memory, denoted 
as episodic & compositional memory (ECM), which is 
structured as a directed, weighted network of clips, where 
each clip represents a remembered percept, action, or se¬ 
quences thereof. Once a percept is observed, the network 
is activated, invoking a random walk between the clips, 
until an action clip is hit and couples out as a real action 
of the agent. 

Using random walks as the basic processing step is mo¬ 
tivated from different perspectives: First, random walks 
have been well-studied in the context of randomized al¬ 
gorithm theory [5D] and probability theory thus 

providing an extensive theoretical toolbox for analyzing 
related models; second, it provides a scheme for pos¬ 
sible physical (rather than computational) realizations. 
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thereby relating the model to the framework of embod¬ 
ied artificial agents |22] : and last, by employing known 
methods for quantum walks, the basic random walk pro¬ 
cess can be quantized, leading to potentially dramatic 
improvements in the agent’s performance. For instance, 
it was recently shown, by some of the authors and collab¬ 
orators, that a quantum variant of the PS agent exhibits a 
quadratic speed-up in deliberation time over its classical 
analogue, which leads to a similar speed-up of learning 
time in active learning scenarios [23]. 

In the PS model, learning is realized by internal modi¬ 
fication of the clip network, both in terms of its structure 
and the weights of its edges. Through interactions with 
a rewarding environment, the clip network adjusts itself 
dynamically, so as to increase the probability of perform¬ 
ing better in subsequent time steps (see below, for a more 
detailed description of the model). Learning is thus based 
on a “trial and error” approach, making the PS model 
especially suitable for solving RL tasks. Indeed, recent 
studies showed that the PS agent can perform very well in 
comparison to standard models, in both basic RL prob¬ 
lems [23] and in more sophisticated tasks, such as the 
“grid-world” and the “mountain car problem” [25] . 

Here we present a simple dynamical mechanism which 
allows the PS network to evolve, through experience, to 
a network that represents and exploits similarities in the 
perceived percepts, i.e. to a network that can generalize. 
This mechanism, which is inspired by the wildcard no¬ 
tion of LCS [TMT5] , is based on a process of abstraction 
which is systematic, autonomous, and, most importantly, 
requires no explicit prior knowledge of the agent. This 
is in contrast with common generalization methods in 
RL, such as function approximation, which often require 
additional input [S]. Moreover, we show that once the 
PS agent is provided with this machinery which allows it 
to both categorize and classify, the rest of the expected 
characteristics we listed above follow directly from the 
basic learning rules of the PS agent. In particular, we 
show that relevant generalizations are learned, that the 
agent associates correct actions to generalized properties, 
and that the entire generalization scheme is flexible, as 
required. 

While in most RL literature elements of generalization 
are considered as means of tackling the “curse of dimen¬ 
sionality” [3], as coined by Bellman [26] and discussed 
above, they are also necessary for an agent to learn in 
certain environments [T] . Here we consider such environ¬ 
ments where, irrespective of its available resources, an 
agent with no generalization ability cannot learn, i.e. it 
performs no better than a fully random agent. 

Following this, we show that the PS model, when en¬ 
hanced with the generalization mechanism, is capable of 
learning in such an environment. Along numerical illus¬ 
trations we provide a detailed analytical description of 
the agent’s performance, with respect to its success- and 
learning-rates (defined below). Such an analysis is feasi¬ 
ble due to the simplicity of the PS model, both in terms 
of the number of its free parameters, and its underly¬ 


ing equations (see also [23]), a property we extensively 
exploit. 

The paper is structured as follows: Section |H] be¬ 
gins, for completeness, with a short description of the 
PS model. Section |H|] presents the proposed generaliza¬ 
tion mechanism, examines its performance in a simple 
case and illustrates how it gives rise to a meaningful gen¬ 
eralization, as defined above. In Sectionjl^we study the 
central scenario of this paper, in which generalization is 
an absolute condition for learning. After describing the 
scenario and showing that the PS agent can cope with it 
(Section IV A[), we analyze its performance analytically 


(Section IV B[ ). Last, in Section IV C we study this sce¬ 
nario for arbitrary number of categories, and observe that 
the more there is to categorize the more beneficial is the 
proposed mechanism. Section [V] concludes the paper. 


II. THE PS MODEL 

In what follows we shortly summarize the basic prin¬ 
ciples of the PS model, for more detailed descriptions we 
refer the reader to references [BUI]- 

The central component of the PS agent is the so-called 
clip network, which can, abstractly, be represented as a 
directed graph, where each node is a clip, and directed 
edges represent allowed transitions, as depicted in Fig. [^ 
Whenever the PS agent perceives an input, the corre¬ 
sponding percept clip is excited (e.g. Clip 1 in Fig. [^. 
This excitation marks the beginning of a random walk 
between the clips until an action clip is hit (e.g. Clip 6 in 
Fig. 0, and the corresponding action is performed. The 
random walk is carried out according to time-dependent 
probabilities pij to hop from one node to another. 



Formally, percept clips are defined as AT-tuples s = 
(si, S 2 ,..., Sfc) G 5 = X ^2 X ... X Sk, Si G {I,..., |5i|}, 
where |5| = |5i| • • • \Sk\ is the number of possible per¬ 
cepts. Each dimension may account for a different type 
of perceptual input such as audio, visual, or sensational, 
where the exact specification (number of dimensions K 
and the perceptual type of each dimension) and reso¬ 
lution (the size |5i| of each dimension) depend on the 
physical realization of the agent. In what follows, we 
regard each of the K dimensions as a different cat¬ 
egory. Action clips are similarly given as M-tuples: 
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a = (oi, 02 ,om) & a = Ai y. A 2 y. ■■■ y. AM,<ii € 
{ 1 ,|vAi|}, where |^| = |^i| • • • \Am\ is the number of 
possible actions. Once again, each of the M dimensions 
provides a different aspect of an action, e.g. walking, 
jumping, picking-up, etc. Here, however, we restrict our 
analysis to the case of M = 1 and varying |^i|. 

Each directed edge from clip Ci to clip Cj has a time 
dependent weight which we call the li-value. 

The h-values define the conditional probabilities of hop¬ 
ping from clip Ci to clip Cj according to 




(ci, Cj) 


( 1 ) 


At the beginning, all /i-values are initialized to the same 
fixed value ho > 0, where we usually set Hq = 1. This 
ensures that, initially, the probability to hop from any 
clip to any of its neighbors is completely uniform. 

Learning takes place by the dynamical strengthening 
and weakening of the internal h-values, in correspondence 
to an external feedback, i.e. a reward A, coming from the 
environment. Specifically, the update of the h-values is 
done according to the following update rule: 

h^*+'^'>{ci,Cj) = h^*\ci,Cj) - 'y{h^*\c„Cj) - 1) -f A, (2) 


where the reward A is non-negative (A = 0 implies no 
reward), and is added only to the h-values of the edges 
that were traversed in the last random walk. The damp¬ 
ing parameter 0 < 7 < 1 weakens the h-values of all 
edges and allows the agent to forget its past experience, 
a useful feature in changing environments [191124] . 


III. GENERALIZATION WITHIN PS 


up to K layers between the layer of percept clips and the 
layer of action clips, with wildcard clips in layer I for 
a particular percept. From each percept- and wildcard- 
clip there are direct edges to all action clips and to all 
matching higher-level wildcard clips^. 

To demonstrate how this mechanism operates we con¬ 
sider the example from the introduction. An agent acts 
as a driver who should learn how to deal with traffic 
lights and arrow signs. While driving, the agent sees a 
traffic light with an arrow sign and should choose among 
two actions (|Ali| = 2): continue driving (-I-) or stop the 
car (—). The percepts that the agent perceives are com¬ 
posed of two categories {K = 2): color and direction. 
Each category has two possible values (|5i| = | 52 | = 2): 
red and green for the color, and left and right for the 
direction. At each time step t the agent thus perceives 
one of four possible combinations of colors and arrows, 
randomly chosen by an environment, and chooses one of 
the two possible actions. In such a setup, the basic PS 
agent, described in the previous section, would have a 
two-layered network of clips, composed of four percept 
clips and two action clips, as shown in Fig. It would 
then try to associate the correct action for each of the 
four percepts separately. The PS with generalization, on 
the other hand, has a much richer playground: it can, in 
addition, connect percept clips to intermediate wildcard 
clips, and associate wildcard clips with action clips, as 
we elaborate below. 



Generalization is usually applicable when the percep¬ 
tual input is composed of more than a single category. 
In the framework of the PS model, this translates to the 
case of AT > 1 in percept space. In particular, when 
two (or more) stimuli are similar, i.e. share a set of com¬ 
mon features, or, more precisely, have the same values for 
some of the categories, it may be useful to process them 
in a similar way. Here we enhance the PS model with a 
simple but effective generalization mechanism based on 
this idea. 

The key feature of this mechanism is the dynamical 
creation of a class of abstracted clips that we call wildcard 
clips. Whenever the agent encounters a new stimulus, the 
corresponding new percept clip is created and compared 
pairwise to all existing clips. For each pair of clips whose 
1 < Z < AT categories carry different values, a new wild¬ 
card clip is created (if it does not already exist) with all 
the different I values replaced with the wildcard symbol 
#. Such a wildcard clip then represents a categorization 
based on the remaining AT — I common categories. 

A wildcard clip with I wildcard symbols is placed in 
the Zth layer of the clip network (we consider the percept 
clip layer as the zeroth layer). In general, there can be 


FIG. 2: The basic PS network as it is built up for the driver 
scenario. Four percept clips (arrow, color) in the first row are 
connected to two action clips (-l-\—) in the second row. Each 
percept-action connection is learned independently. 

The development of the enhanced PS network is shown 
step by step in Fig. for the first four time steps of the 
driver scenario (a hypothetical order of percepts is con¬ 
sidered for illustration). When a left-green signal is per¬ 
ceived at time t = 1, the corresponding percept clip is 
created and connected to the two possible actions (+\—) 
with an initial weight Ziq, as shown in Fig. I^a). In the 
second time step t = 2 , a right-green signal is shown. 
This time, in addition to the creation of the correspond¬ 
ing percept clip, the wildcard clip green) is also cre- 


^ By matching higher-level wildcard clips, we mean wildcard 
clips with more wildcard symbols, whose explicit category val¬ 
ues match with those of the lower-lever wildcard clip. In 
essence, a matching higher-level wildcard clip (e.g. the clip 
(si, S2, #, #)) generalizes further a lower-level wildcard clip (e.g. 
(si,S2,S3,#))- 
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ated - since both of the encountered percepts are green - 
and placed in the first layer of the network^. Newly cre¬ 
ated edges are shown in Fig. |^b) as solid lines, whereas 
all previously created edges are shown as dashed lines. 
Next, at time t = 3, a right-red signal is presented. This 
leads to the creation of the (^, #)-clip, because both 
the second and the third percepts have a right arrow. 
Moreover, since the first percept does not share any sim¬ 
ilarities with the third percept, the full wildcard clip (^, 
is created and placed in the second layer, as shown in 
Fig-ll c) (depicted as a circle with a single # symbol). 
Last, at t = 4, a left-red signal is shown. This causes the 
creation of the (<J=, #)-clip (the left-arrow is shared with 
the first percept) and the (#, red) clip (the red color is 
shared with the third clip), as shown in Fig. |^d). After 
the fourth time step the network is fully established and 
from this point on will only evolve through changes in 
the weights of the edges, i.e. by modifying the /i-values. 




FIG. 3: The enhanced PS network as it is built up for the 
driver scenario, during the first four time steps. The following 
sequence of signals is shown: left-green {t = 1), right-green 
(t = 2), right-red (t = 3), and left-red (t = 4). Four percept 
clips (arrow, color) in the first row are connected to two layers 
of wildcard clips (first layer with a single wildcard and second 
layer with two) and to two action clips (-l-\—) in the fourth 
row. Newly created edges are solid, whereas existing edges are 
dashed (relative weights of the h-values are not represented). 

The mechanism we have described so far, realizes, by 
construction, the first two characteristics of meaningful 
generalization: categorization and classification. In par¬ 
ticular, categorization, the ability to recognize common 
properties, is achieved by composing the wildcard clips 
according to similarities in the coming input. For exam¬ 
ple, it is natural to think of the (#, red) wildcard clip as 
representing the common property of redness^. Likewise, 
classification, the ability to relate a new stimulus to the 
group of similar past stimuli, is fulfilled there by connect¬ 
ing of the lower-level wildcard clips to matching higher 
lever wildcard clips, as described above (where percept 
clips are regarded here as zero-order wildcard clips)^. 


^ To simplify the visualization we draw the wildcard clip (#, green) 
as a green circle with no direction (and without the # symbol). 
In general, we omit one # symbol in all figures. 

® In that spirit, one could interpret the full wildcard clip (#, #) 
as representing a general perceptual input. 

^ Note that classification is done, therefore, not only on the level 
of the percept clips, but also on the level of the wildcard clips. 


While categorization and classification are realized by 
the very structure of the clip network, the remaining list 
of requirements, namely, relevant, correct, and flexible 
generalization, is fulfilled via the update of the /i-values. 
To illustrates that, we next confront the agent with four 
different environmental scenarios, one after the other. 
Each scenario lasts 1000 time steps, following by a sud¬ 
den change of the rewarding scheme, to which the agent 
has to adapt. The different scenarios are listed below: 

(a) At the beginning (1 < t < 1000), the agent is re¬ 

warded for stopping at red light and for driving at 
green light, irrespective of the arrow direction. 

(b) At the second phase (1000 <t< 2000), the agent is 

rewarded for doing the opposite: stopping at green 
light and driving at red light. 

(c) At the third phase (2000 < t < 3000), the agent 

should only follow the arrows: it is rewarded for 
driving (stopping) when the arrow points to the 
left (right). Colors should thus be ignored. 

(d) In the last phase (3000 < t < 4000), the environment 

rewards the agent whenever it chooses to drive, ir¬ 
respective of neither the signal’s color nor its arrow. 

Fig.i sketches four different network configurations 
that typically develop during the above phases. Only 
strong edges of relative large /i-values are depicted, and 
we ignore direct edges from percepts to actions, for clar¬ 
ity, as explained later. At each stage a different config¬ 
uration develops, demonstrating how the relevant wild¬ 
card clips play an important role, via strong connections 
to action clips. Moreover, those wildcard clips are con¬ 
nected strongly to the correct action clips. The relevant 
and correct edges are built through the update rule of 
Eq. ([^, which only strengthens edges that, after hav¬ 
ing been traversed, lead to a rewarded action. Finally, 
the presented flexibility in the network’s configuration, 
which reflects a flexible generalization ability, is due to: 
(a) the existence of all possible wildcard clips in the net¬ 
work; and (b) the update rule of Eq. © , which allows 
the network, through a non-zero damping parameter 7 to 
adapt fast to changes in the environment. We note that 
Fig. 1^ only displays idealized network configurations. In 
practice, other strong edges may exist, e.g. direct edges 
from percepts to actions, which may be rewarded as well. 
In the next Section we address such alternative configu¬ 
rations and analyze their influence on the agent’s success 
rate. 

Fig.j^shows the efficiency, that is, the averaged success 
probability, of the PS agent in the driver scenario, as a 
function of time, averaged over 10"^ agents. A reward of 
A = 1 is given for correct actions and a damping parame¬ 
ter of 7 = 0.005 is used®. It is shown that on average the 


® As always in the PS model, there is a trade off between adapta- 
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FIG. 4: The enhanced PS network configurations (ideal¬ 
ized) as built up for each of the four phases of the driver 
scenario (see text). Only strong edges are shown. Different 
wildcard clips allow the network to realize different general¬ 
izations. Categorization and classification are realized by the 
structure of the network, whereas relevance, correctness and 
flexibility come about through the update rule of Eq. §. 



FIG. 5: The averaged efficiency of the enhanced PS agents as 
simulated for the four different phases of the driver scenario 
(see text). At the beginning of each phase, the agent has to 
adapt to the new rules of the environment. The efficiency 
drops and revives again, thereby exhibiting the mechanism’s 
correctness and flexibility. A damping parameter of 7 = 0.005 
was used, and the average was taken over lO'^ agents. 


agents manage to adapt to each of the phases imposed 
by the environment, and to learn the correct actions. We 
can also see that the asymptotic efficiency of the agents 
is slightly larger in the last phase, where the correct ac¬ 


tion time and the maximum averaged efficiency. A high damp¬ 
ing parameter, 7 , leads to faster relearning, but to lower aver¬ 
aged asymptotic success rates, see also Ref. m- Here we chose 
7 = 0.005 to allow the network to adapt within 1000 time steps. 


tion is independent from the input®. This observation 
indicates that the agent’s efficiency increases when the 
stimuli can be generalized to a greater extent. We will 
encounter this feature once more in Section [IV C[ where 
it is analytically verified. 


IV. NECESSITY OF GENERALIZATION IN 
LEARNING 

A. A simple example 

Sometimes it is necessary for the agent to have a mech¬ 
anism of generalization, as otherwise it has no chance to 
learn. Consider, for example, a situation in which the 
agent perceives a new stimulus every time step. What 
option does it have, other than trying to find some sim¬ 
ilarities among those stimuli, upon which it can act? In 
this section we consider such a scenario and analyze it 
in detail. Specifically, the environment presents one of n 
different arrows, but at each time step the background 
color is different. The agent can only move into one of the 
n > 1 corresponding directions and the environment re¬ 
wards the agent whenever it follows the direction of an ar¬ 
row, irrespective of its color. We call it the neverending- 
color scenario. 

percept clips T @ © © A @ 


action clips 

FIG. 6 : The basic PS network as it is built up in the 
neverending-color scenario. Each percept clip at the first row 
is independently connected to all n action clips at the second 
row. The thickness of the edges does not reflect their weights. 

In this scenario, the basic PS agent has a two-layered 
clip network, of the structure presented in Fig. At 
each time step, a new percept clip is created, from which 
the random walk leads, after a single transition, to one 
of the n possible action clips the agent can perform. The 
problem is that even if the agent takes the correct direc¬ 
tion, the rewarded edge will never take part in later time 
steps, as no symbol is shown twice. The basic PS agent 
has thus no other option but to choose an action at ran¬ 
dom, which will be correct only with probability of 1/n, 
even after infinitely many time steps. In contrast to the 
basic PS, the PS with generalization does show learning 
behavior. The full network is shown in Fig. Percept 


® To understand this, note that: (a) The relevant edge can be 
rewarded at each time step and thus be less affected by the non¬ 
zero damping parameter; and (b) Each wildcard clip necessarily 
leads to the correct action. 
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clips and wildcard clips are connected to matching wild¬ 
card clips and to all actions. Note that the wildcard clips 
(#, color) are never created, as no color is seen twice. 



FIG. 7: The enhanced PS network as it is built up in the 
neverending-color scenario. Each percept- and wildcard-clip 
is connected to higher-level matching wildcard clips and to all 
n action clips. For clarity, only one-level edges to and from 
wildcard clips are solid, while other edges are semitransparent. 
The thickness of the edges does not reflect their weights. 


To illustrate the fundamental difference in performance 
between the basic- and the enhanced-PS model we con¬ 
sider their asymptotic efficiencies. As explained above, 
the basic PS agent can only be successful with probability 
1/n. To see that the enhanced PS agent can do better, 
we take a closer look on the (arrow, clips. These clips 
will, eventually, have very strong edges to the correct ac¬ 
tion clip. In fact, in the case of zero damping (7 = 0 ) 
we consider here, the h-values of these edges will tend to 
infinity with time, implying that once an (arrow, clip 
is hit, the probability to hop to the correct action clip 
becomes unity. This is illustrated for the left-arrow case 
in Fig.[ 8 | 

At each time step, the agent is confronted with a cer¬ 
tain colored arrow. The corresponding new percept clip 
is created and a random walk on the network begins. 
To determine the exact asymptotic efficiency of the en¬ 
hanced PS agent, we should consider two possibilities: 
Either the wildcard corresponding clip (arrow, jf) is hit, 
or it is not. In the first case, which occurs with probabil¬ 
ity p = l/(n -I- 2), the excitation will hop to the correct 
action with unit probability and the agent will be re¬ 
warded. In the second case, no action is preferred over 
the others and the correct action will be reached with 
probability l/nJ Overall, the efficiency of the enhanced 
PS agents is thus given by: 


£oc{n) = p+[l -p) 


I 

n 


l + 2 n 1 1 

- )> — v = - 

n{n -f 2 ) n ’ n -\-2 


which is independent of the precise value of the reward 
A (as long as it is a positive constant). 

In Fig. the average efficiency of the enhanced PS 
agents, obtained through numerical simulation, is plot¬ 
ted as a function of time, in solid red curves, for several 


^ It is possible that an edge from the full wildcard clip (#, 7 ^) 
to some action clip was previously rewarded, yet when averaging 
over all agents we still get an averaged success probability of 1 /n. 



FIG. 8: The enhanced PS network as it is built up for the 
neverending-color scenario with K = 2 categories. Only the 
subnetwork corresponding to the left-arrow is shown. The 
weight of the edge from the wildcard clip (<=, #) to the cor¬ 
rect action clip (•<—) goes to /i = 00 with time. Hopping to the 
(<=, #) clip then leads to the rewarded action with certainty. 
Otherwise, hopping randomly to any of the other clips is only 
successful with probability 1/n. Edges that are relevant for 
the analysis are solid, whereas other edges are semitranspar¬ 
ent. The thickness of the edges does not reflect their weights. 


values of n. Initially, the averaged efficiency is 1/n, i.e. 
completely random (which is the best performance of the 
basic agent). It then grows, indicating that the agents be¬ 
gin to learn how to respond correctly, until it reaches its 
asymptotic value, as given in Eq. § and marked in the 
figure with a dashed blue line. It is seen that in these 
cases, the asymptotic efficiency is achieved already after 
tens to hundreds time steps (see the next Section for an 
analytical expression of the learning rate). The simula¬ 
tions were carried with 10 ® agents and a zero damping 
parameter (7 = 0). Since the asymptotic efficiency of 
Eq. (§ is independent of the reward A and to ease the 
following analytical analysis, we chose a high value of 
A = 1000. Setting a smaller reward would only amount 
to a slower learning curve, but with no qualitative differ¬ 
ence. 


o 

c 

a; 

‘o 

CD 



time step 


FIG. 9: Learning curves of the enhanced PS agents in the 
neverending-color scenario for n = 2,3 and 5 actions. Simu¬ 
lations over 10® agents are shown in red, where a high reward 
of A = 1000 was used. Asymptotic efficiencies £oo(n) (Eq. ([^) 
are shown in dashed blue. The corresponding analytical ap¬ 
proximation curves (Eq. @) are shown in dotted black. 


We have therefore shown that the generalization mech¬ 
anism leads to a clear qualitative advantage in this sce¬ 
nario: without it the agent can not learn, whereas with 
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it, it can. As for the quantitative difference, both Eq. ^ 
and Fig. indicate that the gap in performance is not 
high. Nonetheless, any such gap can be further ampli¬ 
fied. The idea is that for any network configuration of 
which the probability to take the correct action is larger 
than the probability to take any other action, the success 
probability can be amplified to unity by “majority vot¬ 
ing”, i.e. by performing the random walk several times, 
and choosing the action clip that occurs most frequently. 
Such amplification rapidly increases the agent’s perfor¬ 
mance whenever the gap over the fully random agent is 
not negligible. 


B. Analytical analysis 

1. Learning curve 

To analyze the PS learning curves and to predict the 
agent’s behavior for arbitrary n, we take the following 
simplifying assumptions: First, we assume that all pos¬ 
sible wildcard clips are present in the PS network al¬ 
ready from the very beginning; second, we assume that 
edges from the partial wildcard clips (arrow, #) to the 
full wildcard clip (#, #) are never rewarded; last, we set 
an infinite reward A = oo. As shown below, under these 
assumptions, the analysis results in a good approxima¬ 
tion of the actual performance of the agents. 

While Eq. ^ provides an expression for Eaoiji), the 
expected efficiency of the agents at infinity, here we look 
for the expected efficiency at any time step t, i.e. we look 
for £t{n). Taking the above assumptions into account 
and following the same arguments that led to Eq. ^ , we 
note that at any time t, at which the arrow a is shown, 
there are only two possible network configurations that 
are conceptually different: Either the edge from the (a, 
#) clip to the (a) action clip was already rewarded and 
has an infinite weight, or not. Note that while this edge 
must eventually be rewarded, at any finite t this is not 
promised. Let piearn(f) be the probability that the edge 
from the wildcard clip (a, #) to the action clip (o) has an 
infinite h-value at time t, i.e. the probability that the cor¬ 
rect association was learned, then the expected efficiency 
at time t is given by: 

— Plearn(f) ^oo (^) T (1 Plearn(O) * ('^) 

n 

The probability piearn(0 can be written as 

Plearn(i) = 1~ 7 Tfw r~o7 ) ’ 

where the term \/n(n -I- l)(n -|- 2 ) corresponds to the 
probability of finding the rewarded path (labeled as “oo” 
in Fig. 1^: 1/n is the probability that the environment 


presents the arrow (a;),® then the probability to hop from 
the percept clip (x, color) to the wildcard clip (x, jf) is 
l/(n -I- 2), and the probability to hop from the wildcard 
clip (x, #) to the (x) action clip is l/(n-|-1). Finally, we 
take into account the fact that, before time t, the agent 
had {t — 1 ) attempts to take this path and be rewarded. 

The analytical approximation of the time-dependent 
efficiency of the PS, given in Eq. Q is plotted on top 
of Fig. in dotted black, where it is shown to match 
the simulated curves well (in red). The difference in the 
very beginning is caused by assumption that all wildcard 
clips are present in the network from the very begin¬ 
ning, whereas the real agent needs several time steps to 
create them, thus reducing its initial success probabil¬ 
ity. Nonetheless, after a certain number of time-steps 
the simulated PS agent starts outperforming the predic¬ 
tion given by the analytic approximation, because the 
agent can be rewarded for transition from the wildcard 
clip (arrow, #) to the full wildcard clip (#, #), leading 
to higher success probability. 


2. Learning time 

The best efficiency a single PS agent can achieve is 
given by the asymptotic efficiency foo(n) (Eq. (§)• For 
each agent there is a certain time t at which this ef¬ 
ficiency is achieved, and all agents reach this efficiency 
at t = 00 . piearn(Oj defined before, is the proba¬ 
bility that the edge from the relevant wildcard clip to 
the correct action clip was rewarded before time t, and 
can be expressed as a cumulative distribution function 
P{t < t — 1), so that P{t = t) = P{t < t) — P{t < 

t 1) ~ Plearn(^ “b 1) Plearn(^)' 

The expected value of r can be thought of as the learn¬ 
ing time of the agents and can be expressed as a power 
series 

OO OO / / \ t—1 

EM = = = „(„ + !),« + 2)) 

■ (‘~;M + l)(n + 2) ) ) + (6) 

Note that the learning time E[r] reflects the exponential 
decay rate of the “not learning” probability (1—_piearn(0) 
of each agent, as given in Eq. and thereby also the 
decay rate in the number of agents whose network did 
not yet converge to the best configuration. 


Note that here we explicitly use the assumption that the number 
of actions equals the number of presented arrows (n). 






C. More than two categories 

We next generalize the neverending-color task to the 
case of an arbitrary number of categories K. The color 
category may take infinite values, whereas any other cat¬ 
egory can only take finite values, and the number of pos¬ 
sible actions is given by n > 1 . As before, only one 
category is important, namely the arrow direction, and 
the agent is rewarded for following it, irrespective of all 
other input. With more irrelevant categories the envi¬ 
ronment thus overloads the agent with more unnecessary 
information, would this affect the agent’s performance? 



FIG. 10: The enhanced PS network as it is built up for the 
neverending-color scenario with A = 3 categories. Only the 
subnetwork corresponding to the down-arrow is shown. The 
weights of the edges from the wildcard clips (JJ., O) and 
(•111 #1 #)i in the first and second layer, respectively, to the 
correct action clip (4,), go to h = oo with time. Edges that 
are relevant for the analysis are solid, whereas other edges are 
semitransparent. The thickness of the edges does not reflect 
their weights. 


To answer this question, we look for the corresponding 
averaged asymptotic efficiency. As before, in the limit 
of t —> 00 , the wildcard clips which contain the arrows 
lead to a correct action clip with unit probability (for 
any finite reward A and zero damping parameter 7 = 0 ), 
as illustrated in Fig. On the other hand, choosing 
any other clip (including action clips) results with the 
correct action with an averaged probability of only 1/n. 
Accordingly, either the random walk led to a wildcard 
clip with an arrow, or not. The averaged asymptotic 
efficiency for K categories and n actions can hence be 
written as 


£^{n,K) 


p+{l-p) 


1 

n 


n -k (1 -k n) 2 ^ ^ 

n(n + 2^~^) ’ 


(7) 


r)K — 2 

where p = is the probability to hit a wildcard 

clip with an arrow, given by the ratio between the num¬ 
ber of wildcard clips with an arrow ( 2 ^“^), and the total 
number of clips that are reachable from a percept clip 
(n -k 2*-“^).® Note that for two categories Eq. Q cor¬ 
rectly reduces to Eq. 


We can now see the effect of having a large num¬ 
ber K of irrelevant categories on the asymptotic effi¬ 
ciency of the agents. First, it is easy to show that for 
a fixed u, £aoin,K) increases monotonically with K, as 
also illustrated in Fig. This means that although 
the categories provided by the environment are irrele¬ 
vant, the generalization machinery can exploit them to 
make a larger number of relevant generalizations, and 
thereby increase the agent’s performance. Moreover, for 
large K, and, more explicitly, for K ^ logn, the aver¬ 
aged asymptotic efficiency tends to (1 -k 1/n) /2. Con¬ 
sequently, when the number of possible actions n is also 
large, in which case the performance of the basic agent 
would drop to 0, the enhanced PS agents would succeed 


with a probability that tends to 1/2, as shown in Fig. 11 
for n = 2 ^®. 



FIG. 11: The averaged asymptotic efficiency £00 (n, K) for 
the neverending-color scenario (see Eq.l^, as a function of K, 
the number of categories, for n = 2, 2^ . 


Similarly, we note that when none of the categories 
is relevant, i.e. when the environment is such that the 
agent is expected to take the same action irrespective of 
the stimulus it receives, the agents performs even bet¬ 
ter, with an average asymptotic efficiency of K) = 

(1 -k 2-^“^) / (n -k 2^“^). This is because in such a sce¬ 
nario, every wildcard clip eventually connects to the re¬ 
warded action. Accordingly, since each percept clip leads 
to wildcard clips with high probability, the correct ac¬ 
tion clip is likely to be reached. In fact, in the case of 
K 3> log n the asymptotic efficiency of the enhanced PS 
actually tends to 1 . 


V. CONCLUSION 

When the environment confronts an agent with a new 
stimulus in each and every time step, the agent has no 
chance of coping, unless the presented stimuli have some 


® Note that in this scenario, where no color is shown twice, all 
wildcard clips have their color category fixed to a wildcard sym¬ 
bol #. There are thus wildcard clips that are connected 


to each percept clip, where half of them, i.e. 2^ ^ contain an 
arrow. 
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common features that the agent can grasp. The recogni¬ 
tion of these common features, i.e. categorization, and 
classifying new stimuli accordingly, are the first steps 
toward a meaningful generalization as characterized at 
the beginning of this paper. We presented a simple dy¬ 
namical machinery that enables the PS model to realize 
those abilities and showed how the latter requirements of 
meaningful generalization - that relevant generalizations 
are learned, that correct actions are associated with the 
relevant properties, and that the generalization mecha¬ 
nism is flexible - follow naturally from the PS model it¬ 
self. Through numerical and analytical analysis, also for 
an arbitrary number of categories, we showed that the 
PS agent can then learn even in extreme scenarios where 
each percept is presented only once. 

The generalization machinery introduced in this paper, 
enriches the basic PS model: not only in the practical 
sense, i.e. that it can handle a larger range of scenarios, 
but also in a more conceptual sense. In particular, the en¬ 
hanced PS network allows for the emergence of clips that 
represent abstractions or abstract properties, like the red¬ 
ness property, rather then merely remembered percepts 
or actions. Moreover, the enhanced PS network is mul¬ 
tilayered and allows for more involved dynamics of the 
random walk, which, as we have shown, gives rise to a 
more sophisticated behavior of the agent. Yet, although 
the clip network may evolve to more complicated struc¬ 
tures than before, the overall model preserves its inherent 
simplicity, which enables an analytical characterization of 
its performance. 

It is worthwhile to note that the model relies on no a- 
priori knowledge, except of the “existence of categories”. 
In an embodied framework, this “knowledge” is given 
to the agent by its sensory apparatus which induce the 
structure of its percept space. The presented generaliza¬ 
tion machinery then builds upon this structure by con¬ 
sidering all possible subsets of common categories, as ex¬ 
plained in the text. 

This systematic approach, which is successful in real¬ 
izing all possible, flexible, generalizations, has, however, 
two limitations: First, it may potentially create an expo¬ 
nential number of intermediate wildcard clips, which may 
limit the agents efficiency, as some of the wildcard clips 
may harm the agent’s performance (consider, for exam¬ 
ple, the full wildcard clip (^,...,^) in the neverending- 


color scenario, when only the arrow is relevant); second, 
the presented mechanism can not lead to the recogni¬ 
tion of non-trivial correlations. In particular, it induces 
generalization only along predefined categories and only 
when there are two categories or more. For example, the 
agent can, under no circumstances, associate the same ac¬ 
tion to the (^, blue) and (<;=, yellow) percepts, together 
through a wildcard clip, even though this may be required 
by some environment (each clip can only be associated 
separately to the same action). Similarly, generalization 
could, in principle, be possible also when there is just 
a single category. Consider, for example, the color di¬ 
mension. Although it represents a single category in the 
percept space, it is still possible (and sometimes required) 
to split it further, e.g. to dark and light colors. A sin¬ 
gle perceptual category is then divided into two emerged 
groups, thereby freeing the agent from the enforced struc¬ 
ture of its own percept space. This is, however, not yet 
feasible within the presented setup. 

To overcome these limitations, it is intriguing to think 
of a machinery that is more stochastic and less system¬ 
atic in realizing generalization. In such a machinery, 
a self-organizing clip network, including a clip-deletion 
mechanism, will take the place of the well-ordered lay¬ 
ered network. In addition, the structured wildcard clips 
will be replaced by “empty clips” whose semantic is not 
predefined, but rather emerges from the network itself. 
Ideally, the envisioned machinery will not even rely on 
the knowledge of categories and will therefore be com¬ 
pletely free of any a-priori knowledge. One can however 
expect, that such a mechanism will require more time to 
find and sustain meaningful generalization, thereby slow¬ 
ing down the learning speed of the agent. This is subject 
of ongoing work. 


ACKNOWLEDGMENTS 

We wish to thank Markus Tiersch, Dan Browne and 
Elham Kashefi for helpful discussions. This work was 
supported in part by the Austrian Science Fund (FWF) 
through project F04012, and by the Templeton World 
Charity Foundation (TWCF). 


[1] R. S. Sutton and A. G. Barto, Reinforcement learning: 
An introduction (MIT press, 1998). 

[2] S. J. Russell and P. Norvig, Artificial intelligence: a mod¬ 
ern approach (Prentice Hall Englewood Cliffs, 2010), 3rd 
ed. 

[3] M. Wiering and M. van Otterlo (Eds.), Reinforcement 
learning: State of the Art (Springer, 2012). 

[4] C. J. C. H. Watkins, Ph.D. thesis, Cambridge University, 
Cambridge, England (1989). 

[5] G. A. Rummery and M. Niranjan, On-line Q-learning 


using connectionist systems (University of Cambridge, 
1994). 

[6] J. S. Albus, Journal of Dynamic Systems, Measurement, 
and Control 97, 220 (1975). 

[7] R. S. Sutton, Advances in neural information processing 
systems 8, 1038 (1996). 

[8] M. Ponsen, M. E. Taylor, and K. Tuyls, in Adaptive and 
Learning Agents, edited by M. E. Taylor and K. Tuyls 
(Springer Berlin Heidelberg, 2010), vol. 5924 of Lecture 
Notes in Computer Science, chap. 1, pp. 1-32. 



10 


[9] J. Boyan and A. Moore, in Neural Information Processing 
Systems 1 (The MIT Press, Cambridge, MA, 1995), pp. 
369-376. 

[10] S. Whiteson and P. Stone, The Journal of Machine Learn¬ 
ing Research 7, 877 (2006). 

[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Ve- 
ness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. 
Fidjeland, G. Ostrovski, et ah. Nature 518, 529 (2015). 

[12] C. Cortes and V. Vapnik, Machine learning 20, 273 
(1995). 

[13] J. Laumonier, in Proceedings of the national eonference 
on artificial intelligence (Menlo Park, CA; Cambridge, 
MA; London; AAAI Press; MIT Press; 1999, 2007), 
vol. 22, pp. 1882-1883. 

[14] D. Ernst, P. Geurts, and L. Wehenkel, Journal of Ma¬ 
chine Learning Research 6, 503 (2005). 

[15] T. Hester and P. Stone, in Proceedings of The 8th Inter¬ 
national Conferenee on Autonomous Agents and Multia¬ 
gent Systems (International Foundation for Autonomous 
Agents and Multiagent Systems, 2009), vol. 2, pp. 717- 
724. 

[16] J. H. Holland, Progress in Theoretical Biology 4, 263 
(1976). 


[17] J. H. Holland, Machine Learning 2, 593 (1986). 

[18] R. J. Urbanowicz and J. H. Moore, Journal of Artificial 
Evolution and Applications pp. 1-25 (2009). 

[19] H. J. Briegel and G. De las Cuevas, Scientific reports 2 

( 2012 ). 

[20] R. Motwani and P. Raghavan, Randomized Algorithms 
(Cambridge University Press, New York, NY, USA, 
1995), chap. 6. 

[21] 1. G. Sinai, Probability Theory: An Introductory Course 
(Springer-Verlag, 1992), chap. 6. 

[22] R. Pfeiffer and C. Scheier, Understanding intelligence 
(MIT Press, Cambridge Massachusetts, 1999), 1st ed. 

[23] G. D. Paparo, V. Dunjko, A. Makmal, M. A. Martin- 
Delgado, and H. J. Briegel, Phys. Rev. X 4, 031002 
(2014). 

[24] J. Mautner, A. Makmal, D. Manzano, M. Tiersch, and 
H. J. Briegel, New Generation Computing 33, 69 (2014). 

[25] A. A. Melnikov, A. Makmal, and H. J. Briegel, Art. Int. 
Research 3 (2014), arXiv: 1405.5459. 

[26] R. Bellman, Dynamic Programming, Rand Corporation 
research study (Princeton University Press, 1957). 



